Duplicate detection in the Reuters collection
نویسنده
چکیده
In a bibliographic database, the main task is not to find exact duplicate records, rather it is to find those that refer to the same work but differ in some manner. Differences are typically due to inaccurate or inconsistent data entry. One such detection method was developed by Ridley [Ridley 92] who adopted a two stage technique. First, all records in a database were assigned a number generated from a hashing function that used as its input, fields of a bibliographic record. Any records that had the same hashing number were examined in greater detail in the second stage. This entailed a comparison of fields by customised processes: i.e. the author field process looked for missing initials; the title field process looked for a missing suffix. Detection techniques of this kind are supported by the work of O’Neill et al. [O’Neill 93] who manually examined duplicate bibliographic records to find which fields were most likely to differ.
منابع مشابه
Duplicate detection in the Reuters collection 1
In a bibliographic database, the main task is not to find exact duplicate records, rather it is to find those that refer to the same work but differ in some manner. Differences are typically due to inaccurate or inconsistent data entry. One such detection method was developed by Ridley [Ridley 92] who adopted a two stage technique. First, all records in a database were assigned a number generat...
متن کاملA New Method for Duplicate Detection Using Hierarchical Clustering of Records
Accuracy and validity of data are prerequisites of appropriate operations of any software system. Always there is possibility of occurring errors in data due to human and system faults. One of these errors is existence of duplicate records in data sources. Duplicate records refer to the same real world entity. There must be one of them in a data source, but for some reasons like aggregation of ...
متن کاملIdentification of MIR-Flickr Near-duplicate Images - A Benchmark Collection for Near-duplicate Detection
There are many contexts where the automated detection of near-duplicate images is important, for example the detection of copyright infringement or images of child abuse. There are many published methods for the detection of similar and near-duplicate images; however it is still uncommon for methods to be objectively compared with each other, probably because of a lack of any good framework in ...
متن کاملReuters test collection Saturday , 11 June , 1994
This short paper presents the little known Reuters 22,173 test collection, which is significantly larger than most traditional test collections. In addition, Reuters has none of the recall calculation problems normally associated with some of the larger test collections now available. This paper explains the method (derived from Lewis [Lewis 91]) used to perform retrieval experiments on the Reu...
متن کاملQuantifying the Specificity of Near-duplicate Image Classification Functions
There are many published methods for detecting similar and near-duplicate images. Here, we consider their use in the context of unsupervised near-duplicate detection, where the task is to find a (relatively small) nearduplicate intersection of two large candidate sets. Such scenarios are of particular importance in forensic near-duplicate detection. The essential properties of a such a function...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 1997